The primary focus of this exploratory analysis is to better understand the Black Friday dataset. This involves examining each variable individually and the relationships between multiple variables. This notebook begins by explaining the data at a high level and what it consists of. It then discusses the issues identified in the dataset and how the team proposes to resolve them. A large chunk of the analyses are different visualizations demonstrating the individual attributes and relationships between them. This analysis ends with a look into new ideas and visualizations of the data.
The day after Thanksgiving is commonly known as Black Friday. On Black Friday, many stores and retailers open extremely early and offer heavy promotional sales. Although Black Friday is known to offer the best prices on numerous products, it is very chaotic and sometimes dangerous, with crowds gathering hours and days in advance to snag such deals. Thus, there is no doubt Black Friday is generally the busiest shopping day of the year.
This dataset is originally from a competition hosted by Analytics Vidhya and was obtained from Kaggle. The dataset can be found at this link: https://www.kaggle.com/mehdidag/black-friday. The original competition revolved around a company wanting to understand the customer purchase behavior against various products of different categories. It focused specifically on building a model to predict the purchase amount against various products for each customer to help create personalized offers. However, Kaggle presents the data in a way that opens up the type of analysis that can be performed.
There are various possibilities on outcomes from the dataset. For regression, the focus will be predicting the purchase amount. The original competition on Analytics Vidhya evaluates the predictions submitted against the actual purchase amount, scoring the result on the root mean squared error (RMSE). This will demonstrate the prediction error in the regression model created. For future projects concerning classification and clustering, a success rate goal will be pre-determined by the team.
# Load the Black Friday dataset
import pandas as pd
import numpy as np
# Vizualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
bf_df = pd.read_csv('../data/BlackFriday.csv')
bf_df.head()
The Black Friday dataset consists of 537,577 records and 12 attributes.
Below is a table consisting of all the variables in the dataset. The table includes the variable name, type of data, scale, and a description.
It is important to note that various variables true meaning has been masked from the dataset due to confidentiality, as indicated by (Masked) below. The variables will continue to be included in the analysis.
| Variable | Type | Scale | Values | Description |
|---|---|---|---|---|
| User_ID | int | ordinal | [(1000001,...,1006040), 5891 total] | Unique customer identification number |
| Product_ID | object | ordinal | [(P00000142,...,P0099942), 3623 total] | Unique product identification number |
| Gender | object | nominal | (0, 1) | Sex of user (M or F) |
| Age | object | interval | [(0-17,18-25,26-35,36-45,46-50,51-55,55+) | Range of customer age |
| Occupation | int | nominal | (0,1,2,3,4,5,6,7,8,9,10,11,12,...,20) | Job role (Masked) |
| City_Category | object | nominal | (A, B, C) | City customer resides in |
| Stay_In_Current_City_Years | object | nominal | (0, 1, 2, 3, 4+) | Number of years living in city |
| Marital_Status | int | nominal | (0, 1) | Married (Yes or No) |
| Product_Category_1 | int | nominal | (1,2,3,4,5,6,7,8,9,10,...,18) | Main product category (Masked) |
| Product_Category_2 | float | nominal | (NaN, 2,3,4,5,6,7,8,9,...,18) | Product may belong to second category (Masked) |
| Product_Category_3 | float | nominal | (NaN, 3,4,5,6,8,9,10,...,18) | Product may belong to third category (Masked) |
| Purchase | int | ratio | [(185,...,23948), 17959 total] | Amount spent on purchase (USD) |
bf_df.shape
The Black Friday dataset contains 537577 observations and 12 attributes.
print (bf_df.dtypes)
Total number of Unique User Ids in the Black Friday dataset
uniqueCount = pd.value_counts(bf_df.User_ID)
uniqueCount
uniqueCountList = pd.Series({'Unique User Ids': len(uniqueCount), 'Unique User Id Values': uniqueCount.index.tolist()})
uniqueCountList
Total number of Unique Product Ids in the Black Friday dataset
uniqueCount = pd.value_counts(bf_df.Product_ID)
uniqueCount
uniqueCountList = pd.Series({'Unique Product Ids': len(uniqueCount), 'Unique Product Id Values': uniqueCount.index.tolist()})
uniqueCountList
Total Number of unique age group in the Blak Friday Dataset
uniqueCount = pd.value_counts(bf_df.Age)
uniqueCount
uniqueCountList = pd.Series({'Unique Age Groups': len(uniqueCount), 'Unique Age Group Values': sorted(uniqueCount.index.tolist())})
uniqueCountList
uniqueCount = pd.value_counts(bf_df.Product_Category_1)
uniqueCount
uniqueCountList = pd.Series({'Unique Product_Category_1': len(uniqueCount), 'Unique Product_Category_1 Values': sorted(uniqueCount.index.tolist())})
uniqueCountList
uniqueCount = pd.value_counts(bf_df.Stay_In_Current_City_Years)
uniqueCount
uniqueCountList = pd.Series({'Unique Stay_In_Current_City_Years': len(uniqueCount), 'Unique Stay_In_Current_City_Years Values': sorted(uniqueCount.index.tolist())})
uniqueCountList
User_ID identifies the customer who purchased the products. There are multiple rows for the same user, which means the same user has done multiple purchases. This is not a mistake, as more records per customer provide more data to use in developing models. We can preserve the attribute for analysis and predictions instead of dropping the attribute.
Based on the information below, there are missing values for two attributes (Product_Category_2 and Product_Category_3). Product Category 2 is missing 166,986 values and Product Category 3 is missing 373,299 values. Since those values are for extra classifications of product category and have been assigned NaN, it denotes that there were no items purchased in those categories on those observations. We chose to assign a value of 0 to all the missing values in the two columns. To be consistent, we will convert the types of those product categories to integer as well. The 0 is a placeholder and does not affect the integrity of the data. There are Product_Category_1 values for every record, so we made the decision to move forward this way.
After replacing the NaNs with 0, the data was checked to verify there are no missing values.
Upon analysis of the variables, the Purchase attribute contains outliers. It is important to note that the majority of the variables are categorical and will not contain outliers. Upon examining the boxplot of the purchase amount, the values above 21,500 are considered outliers. The minimum value for Purchase is 185 and the maximum is 23,961. Since the outliers are not significantly larger, the analysis will be performed with and without the values to determine the best model.
print (bf_df.info())
# Replace NaNs with 0s in all columns
bf_df.fillna(0, inplace=True)
# Change the data type of Product_Category_2 and Product_Category_3
bf_df["Product_Category_2"] = bf_df["Product_Category_2"].astype(np.int64)
bf_df["Product_Category_3"] = bf_df["Product_Category_3"].astype(np.int64)
bf_df.info()
sns.boxplot(bf_df["Purchase"])
The following table shows the basic statistics for continuous variables. In the table below, the only continuous variable is the Purchase column. This information is also referenced when discussing outliers. I find it important to note that although the mean is 9333.86, the range of values is from 18 to 23961. It will be interesting to see the relationships with the categorial variables.
# Get statistics for the continuous attributes
bf_df.describe()
This plot explores Gender and indicates there are more male customers making purchases than female customers. This will be useful information as relationships between attributes are explored later.
sns.countplot(bf_df.Gender)
This plot explores Marital_Status and indicates unmarried customers are making more purchases than married customers. This could be due to a variety of reasons, such as age, income, and occupation. It will be interesting to see the relationship with the other attributes.
sns.countplot(bf_df.Marital_Status)
This plot explores City_Category and indicates the most purchases are made from customers that reside in city B. The purchases mabe by customers in City A and C are close, however, those in City C make the second most purchases. It will be interesting to see the correlation between city and purchase amount.
sns.countplot(bf_df.City_Category)
This plot explores Occupation and indicates that there are a variety of occupations making a high number of purchases. The ocupations of customers that stand out as making the most purchases are occupation 4, 0, and 7. It will be intersting to see the relationship between occupation and total or average purchase amount.
sns.countplot(bf_df.Occupation)
This plot explores Product_Category_1 as that is indicated as the main product category. This shows that products in the cateogries 5, 1, and 8 are the most popular, being purchased more than the other categories. It will be intersting to see the relationship between product category 1 and purchase amount as well as age to gain further insight.
sns.countplot(bf_df.Product_Category_1)
The pie plot below clearly displays the Gender variable, showing the percentage of unique female and male customers. This is the best visualization for the information, because it demonstrates there are significantly more male customers than female customers. To be precise, there are 1666 female customers and 4225 male customers in the dataset.
print("Total Number of Purchase Transactions by Male Customers : ",len(bf_df[bf_df.Gender=='M']))
print("Total Number of Purchase Transactions by Female Customers : ",len(bf_df[bf_df.Gender=='F']))
print("Total Number of Unique Female & Male User Ids in Dataset : ",bf_df.groupby('Gender')['User_ID'].nunique())
plt.pie(bf_df.groupby('Gender')['User_ID'].nunique(),labels=['Female','Male'],shadow=True,autopct='%1.1f%%',
colors=['pink','steelblue'])
plt.title('Percentage of Unique Female & Male Customers')
plt.show()
The below pie plot also examines Gender, but looks at the total purchase amount by gender. It seems to agree with the data above, that since there are more male customers, male customers will spend more money than female customers. Again, this plot clearly shows the differentiation between the two values of the variable. Male customers are responsible for 76.8 percent of the total purchase amount, while female customers are responsible for 23.2 percent.
print(bf_df.groupby('Gender')['Purchase'].sum())
plt.pie(bf_df.groupby('Gender')['Purchase'].sum(),labels=['Female','Male'],shadow=True,
autopct='%1.1f%%',colors=['pink','steelblue'])
plt.title('Total Purchase by Gender')
plt.show()
From the above pie chart it is clear that
28.3% of Females contributed 23.2% of total purchase.
71.7% Male contributed 76.8% of total purchase.
priceRange = [20000,15000,10000,5000,3000,1000]
for price in priceRange:
print("Total % of Purchase where Price Range >= ",str(price))
print(len(bf_df[bf_df.Purchase>=price])/len(bf_df)*100.0)
This analysis examines the total percent of purchases that fall in the defined price ranges. It is intersting that 83.4 percent of the purchases are greater than or equal to 5,000 dollars, and the percentage drops to 34.7 percent of the purchases being greater than or equal to 10,000 dollars. It is also important to note that total percent of purchases greater than or equal to 1,000 dollars is 98.7, which is a large majority.
import plotly.offline as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
from matplotlib import pyplot as plt
from plotly import tools
#Purchase amount in USD by City categories
labels = sorted(bf_df.City_Category.unique())
values = bf_df.groupby("City_Category").sum()["Purchase"]
colors = ['#E74B2C', '#26C05E', '#26A6CO']
trace = go.Pie(labels=labels, values=values,
hoverinfo='label+percent', textinfo='value',
textfont=dict(size=20),
title = 'Purchase Amount by City',
marker=dict(colors=colors,
line=dict(color='#000000', width=2)))
iplot([trace])
This plot displays the purchase amount for the different cities. A pie plot is the most appropriate, clearly demonstrating how much of the total comes from the respective cities. City B is seen to spend the most, which relates to the above bar plot that shows the customers in city B are responsible for making the highest number of purchases.
#Most Purchased Product Category by Counts
data = [go.Bar(
x=bf_df.Product_Category_1.value_counts().sort_index().index,
y=bf_df.Product_Category_1.value_counts().sort_index().values
)]
layout = go.Layout(
title='Most Purchased Product Category',
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
This bar plot indicates that the most purchased products belong to category 5. A bar plot clearly defines the purchases for each of the categories. The most number of purchases for category 5 is also equivalent to approximately 148,000 purchases.
sns.distplot(bf_df['Purchase'], hist=True, rug=True)
The plot below clearly shows the distribution of purchase amount, and the line of fit helps us see the pattern. The highest peak seems to be around 8,000 dollars by looking at the plot. The mean purchase amount is 9,333 dollars, which seems to agree with this plot. The purchase amounts seem to range between 0 and 15,000 dollars for the most part, with other peaks around 16,000 and 20,000 dollars. It will be interesting to see how this relates to the other attributes.
import plotly.offline as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
py.init_notebook_mode(connected=True)
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
#sns.pairplot(bf_df, hue="Gender", height=2)
Pair plots are not very helpful because of the volume of data
male_observations = bf_df[bf_df['Gender'] == 'M']
female_observations = bf_df[bf_df['Gender'] == 'F']
male_purchases_by_age = bf_df.groupby(by=['Age','Gender'])
print(male_purchases_by_age.Purchase.sum())
female_purchases_by_age = female_observations.groupby("Age").Purchase
This analysis shows the purchase amount by gender per age group. This data will be better visualized in the next section.
Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.
male_purchase_percentages = male_observations.groupby("Age").Purchase.sum() / bf_df.Purchase.sum() * 100
female_purchase_percentages = female_observations.groupby("Age").Purchase.sum() / bf_df.Purchase.sum() * 100
trace0 = go.Scatter(x = male_purchase_percentages.index,
y = male_purchase_percentages,
name = 'Male',
mode = 'lines+markers')
trace1 = go.Scatter(x = female_purchase_percentages.index,
y = female_purchase_percentages,
name = 'Female',
mode = 'lines+markers')
data = [trace0, trace1]
layout = dict(title = 'Percentage of Purchases by Gender & Age Groups [1]',
xaxis = dict(title = 'Age Group'),
yaxis = dict(title = 'Purchase Percentage (%)')
)
figure = dict(data=data, layout=layout)
iplot(figure)
Based on the line chart, we observe that males have made more purchases than females, which agrees with the visualizations created above. Also, the age group of customers ranging from 26-35 years old has made the most purchases. For this age group, males have contributed 31 percent of the purchase amount and females have contributed 8.6 percent. The most purchases are made in the age range of 18-45, which tends to be a major chunk of the working age. The percent of purchases decreases as the age range passes 45. This may be due to being settled down. It is important to note that females account for a smaller percent of purchases across all age groups
#prdctCat1_purchases_by_age = bf_df.groupby(by=['Age','Product_Category_1'])
#print(prdctCat1_purchases_by_age.Purchase.sum())
outdflist =[]
cat1Values = sorted(pd.value_counts(bf_df.Product_Category_1).index.tolist())
for cat1 in cat1Values:
# create a subdf with desired columns:
subdf = bf_df[bf_df['Product_Category_1'] == cat1]
newSubDf = subdf[['Product_Category_1','Age','Purchase']].copy()
outdflist.append(newSubDf)
outdfCntlist =[]
for df in outdflist:
dfCnt = df.groupby("Age").Purchase.count() / bf_df.Purchase.count() * 100
outdfCntlist.append(dfCnt)
traceList = []
index=0
for cnt in outdfCntlist:
index+=1;
trace = go.Scatter(x = cnt.index,
y = cnt,
name = 'PrdctCat1 = '+str(index),
mode = 'lines+markers')
traceList.append(trace)
layout = dict(title = 'Percentage of Purchase Count by Product Category1 & Age Group',
xaxis = dict(title = 'Age Group'),
yaxis = dict(title = 'Purchase Count Percentage (%)')
)
figure = dict(data=traceList, layout=layout)
iplot(figure)
From the chart it looks like Product Category 1 = "5" has been purchased more across all age groups, but the age group 55+ seems to have purchased more products which belong to category 8. This may be due to a shift in priorities at different ages. Again, the largest percentage of purchase count is in the age range of 26-35, with the largest categories being 1 and 5.
# The dataset does not contain any data in the married age group 0-17 which could be because of the age limit for marriages in
# the US. We wii exclude that age group to make the plot look better.
married_observations = bf_df[bf_df['Marital_Status'] == 1]
unmarried_observations = bf_df[(bf_df['Marital_Status'] == 0) & (bf_df['Age'] != '0-17')]
married_purchase_averages = married_observations.groupby("Age").Purchase.mean()
unmarried_purchase_averages = unmarried_observations.groupby("Age").Purchase.mean()
trace0 = go.Scatter(x = married_purchase_averages.index,
y = married_purchase_averages,
name = 'Married',
mode = 'lines+markers')
trace1 = go.Scatter(x = unmarried_purchase_averages.index,
y = unmarried_purchase_averages,
name = 'Unmarried',
mode = 'lines+markers')
data = [trace0, trace1]
layout = dict(title = 'Average Purchase Amount by Marital Status & Age Groups',
xaxis = dict(title = 'Age Group'),
yaxis = dict(title = 'Average Purchase ($)')
)
figure = dict(data=data, layout=layout)
iplot(figure)
Based on the plot, customers in the 46-50 age group spend the least amount on purchases. For customers in the age range of 26-35 there is a constant purchase average, irrespective of marital status. Similarly, for the age range of 18-25, there is not a large difference based on marital status. We found the decline in average purchases from age 36-45 to 46-50 interesting, especially considering the incline from age 46-50 to 51-55. Customers in the age range 51-55, both married and unmarried, have the highest purchase amount average. This is fascinating, since we have seen before that the customers between 26-35 have the highest purchase amount total.
Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).
bf_df['Age'].value_counts()
The above data shows the unique number of customer age category from the black friday dataset. As seen above the age group 26-35 contributed to more sales/ made more purchase. One possible reason is this is group where more individuals would belong to active working group as well.
bf_df['Product_ID'].value_counts().nlargest(5)
The above are the top 5 most popular or frequently bought products.
prdct1_observations = bf_df[bf_df['Product_ID'] == 'P00265242']
prdct2_observations = bf_df[bf_df['Product_ID'] == 'P00110742']
prdct3_observations = bf_df[bf_df['Product_ID'] == 'P00025442']
prdct4_observations = bf_df[bf_df['Product_ID'] == 'P00112142']
prdct5_observations = bf_df[bf_df['Product_ID'] == 'P00057642']
prdct1_purchase_averages = prdct1_observations.groupby("Age").Purchase.count()
prdct2_purchase_averages = prdct2_observations.groupby("Age").Purchase.count()
prdct3_purchase_averages = prdct3_observations.groupby("Age").Purchase.count()
prdct4_purchase_averages = prdct4_observations.groupby("Age").Purchase.count()
prdct5_purchase_averages = prdct5_observations.groupby("Age").Purchase.count()
trace0 = go.Scatter(x = prdct1_purchase_averages.index,
y = prdct1_purchase_averages,
name = 'Top 1st P00265242',
mode = 'lines+markers')
trace1 = go.Scatter(x = prdct2_purchase_averages.index,
y = prdct2_purchase_averages,
name = 'Top 2nd P00110742',
mode = 'lines+markers')
trace2 = go.Scatter(x = prdct3_purchase_averages.index,
y = prdct3_purchase_averages,
name = 'Top 3rd P00025442',
mode = 'lines+markers')
trace3 = go.Scatter(x = prdct4_purchase_averages.index,
y = prdct4_purchase_averages,
name = 'Top 4th P00112142',
mode = 'lines+markers')
trace4 = go.Scatter(x = prdct5_purchase_averages.index,
y = prdct5_purchase_averages,
name = 'Top 5th P00057642',
mode = 'lines+markers')
data = [trace0,trace1,trace2,trace3,trace4]
layout = dict(title = 'Purchase Count for Top Products across Age groups',
xaxis = dict(title = 'Age group'),
yaxis = dict(title = 'Purchase Count')
)
figure = dict(data=data, layout=layout)
iplot(figure)
The above chart shows how the purchase count varies for the top 5 products acroos different age groups.
The product Id "P00265242" has been purchased 1858 times remains the top product for all age categories except "0 to 17" and "36 to 45" years age group.
The product Id "P00112142" which is in the top 4th position, remains the second or third most bought product for age categoris between 0 to 35.
But Product "P00025442" which is top 3rd in the list remains the second or third most bought product for age 36 plus.
This data may be useful for product promotion that is targetted to customers based on their age group.
city_A_observations = bf_df[bf_df['City_Category'] == 'A']
city_B_observations = bf_df[bf_df['City_Category'] == 'B']
city_C_observations = bf_df[bf_df['City_Category'] == 'C']
cityA_purchase_averages = city_A_observations.groupby("Stay_In_Current_City_Years").Purchase.sum()
cityB_purchase_averages = city_B_observations.groupby("Stay_In_Current_City_Years").Purchase.sum()
cityC_purchase_averages = city_C_observations.groupby("Stay_In_Current_City_Years").Purchase.sum()
trace0 = go.Scatter(x = cityA_purchase_averages.index,
y = cityA_purchase_averages,
name = 'City A',
mode = 'lines+markers')
trace1 = go.Scatter(x = cityB_purchase_averages.index,
y = cityB_purchase_averages,
name = 'City B',
mode = 'lines+markers')
trace2 = go.Scatter(x = cityC_purchase_averages.index,
y = cityC_purchase_averages,
name = 'City C',
mode = 'lines+markers')
data = [trace0,trace1,trace2]
layout = dict(title = 'Average Purchase Amount by City Category & Stay_In_Current_City_Years',
xaxis = dict(title = 'Stay_In_Current_City_Years'),
yaxis = dict(title = 'Total Purchase ($)')
)
figure = dict(data=data, layout=layout)
iplot(figure)
The above chart shows the relationship between City, Stay in current City & Purchase. A prominent spike in purchase total is observed when stay in current city is between 0 to 1 year. There is a sharp fall in purchase total for stay between 1 to 2 years. The Purchase total does not rise any further but starts to gradually decrease as stay in current City increase. However, for City category B this trend for purchase total seem to be gradually increase as stay in current city approaches 3 years. This graph, reveals some interesting facts from the data set.
Are there other features that could be added to the data or created from existing features? Which ones?
def map_PurchaseAmnt(purchase):
if purchase >= 20000:
return '>=20K'
elif purchase >= 15000 and purchase < 20000:
return '>=15K < 20K'
elif purchase >= 10000 and purchase < 15000:
return '>=10K < 15K'
elif purchase >= 5000 and purchase < 10000:
return '>=5K < 10K'
elif purchase >= 3000 and purchase < 5000:
return '>=3K < 5K'
elif purchase >= 1000 and purchase < 3000:
return '>=1K < 3K'
else:
return 'Less Than 1K'
bf_df['PurchaseAmntCategory'] = bf_df['Purchase'].map(map_PurchaseAmnt)
df_groupByPurchaseAmntAndCategry = bf_df.groupby(by=['Product_Category_1','PurchaseAmntCategory'])
outdflist =[]
order1 = ['>=20K','>=15K < 20K','>=10K < 15K','>=5K < 10K','>=3K < 5K','>=1K < 3K','Less Than 1K']
order = {key: i for i, key in enumerate(order1)}
print(order)
amntCatgryValues = sorted(pd.value_counts(bf_df.PurchaseAmntCategory).index.tolist(), key=order1.index, reverse=True)
print(amntCatgryValues)
for amntCatgry in amntCatgryValues:
# create a subdf with desired columns:
subdf = bf_df[bf_df['PurchaseAmntCategory'] == amntCatgry]
newSubDf = subdf[['PurchaseAmntCategory','Product_Category_1','Purchase']].copy()
outdflist.append(newSubDf)
outdfCntlist =[]
for df in outdflist:
dfCnt = df.groupby("Product_Category_1").Purchase.sum()
#print(dfCnt)
outdfCntlist.append(dfCnt)
traceList = []
index=0;
for cnt in outdfCntlist:
#print(cnt.index.tolist())
trace = go.Bar(x = cnt,
y = cnt.index,
name = amntCatgryValues[index],
orientation = 'h',
marker = dict(line = dict(width = 1)))
index +=1
traceList.append(trace)
layout = go.Layout(height=750,
barmode='stack',bargap =0.1,title = 'Total Purchase amount by Amount category across Different Product categories',
xaxis = dict(title = 'Total Purchase amount ($) by Amount category'),yaxis = dict(title = 'Product categories'))
fig = go.Figure(data=traceList, layout=layout)
py.iplot(fig, filename='marker-h-bar')
This horizontal bar chart shows the total purchase amount by amount category for all Product Category1 (1 to 18). The amount category is a new column that categorizes the row based on the Purchase amount. Below are the Amount Category created for every row in the dataset.
As shown above Product Cateory1 =1 contributes to the maximum purchase amount total. Product Category1=1, has the maximum purchase total of 1.9B, the major contribution comes from products which costs between 15 to 20K USD. These products contribute 1.26 B of total purchase which is 66% of total purchase within Product Category1=1.
The next Category with highest total purchase amount is Product Category1=5. The major contribution comes from products which costs between 5 to 10K USD. The above also applies to Product Category1=8 which has the third highest total purchase amount.
Other interesting observation here is, the products which costs more than >=20K USD are observed only in product category 1 = 6,7,10,15,16 but none of these categories have a significantly higher total purchase amount.
Also the products which is less than 1K USD are observed only in product categories1=4,12,13,18. The contribution to overall purchase amount total comes from Products that costs between 15 to 20K USD & 5 to 10K USD.
def map_ageCategory(age):
if age == '0-17':
return 'Teen'
elif age == '18-25':
return 'Early Twenties'
elif age == '26-35':
return 'Late Twenties and Early Thirties'
elif age == '36-45':
return 'Mid Thirties and Early Forties'
elif age == '46-50':
return 'Late Forties'
elif age == '51-55':
return 'Early Fifties'
else:
return 'Late Fifties Plus'
bf_df['AgeCategory'] = bf_df['Age'].map(map_ageCategory)
df_groupByAge = bf_df.groupby(by=['AgeCategory'])
print ("Percentage of Purchase in each age category:")
print (df_groupByAge.Purchase.sum() / bf_df.Purchase.sum() *100)
dfAgeCatgry = bf_df.groupby("AgeCategory").Purchase.sum().sort_values()
print(dfAgeCatgry)
trace0 = go.Scatter(
x=dfAgeCatgry,
y=dfAgeCatgry.index,
mode='markers',
marker=dict(
color=['rgb(93, 164, 214)', 'rgb(255, 144, 14)',
'rgb(44, 160, 101)', 'rgb(255, 65, 54)','rgb(25, 55, 65)','rgb(55, 25, 75)','rgb(150, 165, 254)'],
opacity=[1, 0.8,0.5, 0.6, 0.5,0.6,0.7],
size=[30,40,50,60,70,75,80],
)
)
layout = go.Layout(
title='Purchase total by Age Category',
xaxis=dict(
title='Total Purchase',
gridcolor='rgb(255, 255, 255)',
gridwidth=2,
),
yaxis=dict(
title='Age Category',
gridcolor='rgb(255, 255, 255)',
gridwidth=2,
),
margin=go.layout.Margin(
l=250,
r=50
),
paper_bgcolor='rgb(243, 243, 243)',
plot_bgcolor='rgb(243, 243, 243)',
)
data = [trace0]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig,filename='bubblechart-color')
This bubble chart shows the total purchase amount by Age categories. The age category is a new column that categorizes the row based on the customer age range. Below are the Age Category created for every row in the dataset.
• Teen
• Late Fifties Plus
• Early Fifties
• Late Forties
• Early Twenties
• Mid Thirties and Early Forties
• Late Twenties and Early Thirties
The bubble chart above shows themagnitude fo Purchase amount total for different Age category. As noticed earlier, the age category "Late Twenties And Early Thirties" has an overall purchase total of 2B. This is 1 billion greater than the next age category "Mid Thirties and Early Forties". This is a significant increase in the purchase amount total between the first & second top age catgories.
Provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.
fig = {
"data": [
{
"type": 'violin',
"x": bf_df['AgeCategory'] [ bf_df['Gender'] == 'M' ],
"y": bf_df['Purchase'] [ bf_df['Gender'] == 'M' ],
"legendgroup": 'M',
"scalegroup": 'M',
"name": 'M',
"box": {
"visible": True
},
"meanline": {
"visible": True
},
"line": {
"color": 'blue'
}
},
{
"type": 'violin',
"x": bf_df['AgeCategory'] [ bf_df['Gender'] == 'F' ],
"y": bf_df['Purchase'] [ bf_df['Gender'] == 'F' ],
"legendgroup": 'F',
"scalegroup": 'F',
"name": 'F',
"box": {
"visible": True
},
"meanline": {
"visible": True
},
"line": {
"color": 'pink'
}
}
],
"layout" : {
"yaxis": {
"zeroline": False,
},
"violinmode": "group"
}
}
py.iplot(fig, filename = 'violin/grouped', validate = False)